We can do lot of stuffs in R. Starting from statistical analysis to plotting graphs and figures, Writing technical documentation to making a website and lot more. Lets explore.
https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
According to recent editorials, the reproducibility crisis is still on-going
Reality check on reproducibility
1,500 scientists lift the lid on reproducibility
Nature, May 2016
| https://rladies.org | https://www.r-bloggers.com/ |
|---|---|
Where and How to do R?
R can be done/executed using command line, or a graphical user interface (GUI). On this session, we will use the RStudio GUI.
To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on them.
Data structures are very important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.
Everything in R is an object (Also we refer as variables)
Formal data types -
"a", "swc"2, 15.5TRUE, FALSE2L (the L tells R to store this as an integer)1+4i (complex numbers with real and imaginary parts)A vector is the most common and basic data structure in R and is pretty much the workhorse of R.
Using TRUE and FALSE will create a vector of mode logical:
While using quoted text will create a vector of mode character:
The functions typeof(), length() provide useful information about your vectors and R objects in general.
## [1] "character"
## [1] 3
The function c() (for combine) can also be used to add elements to a vector.
## [1] "Sarah" "Tracy" "Jon" "Annette"
## [1] "Greg" "Sarah" "Tracy" "Jon" "Annette"
You can create vectors as a sequence of numbers.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4
## [16] 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
## [31] 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4
## [46] 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
## [61] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
## [76] 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9
## [91] 10.0
R supports missing data in vectors. They are represented as NA (Not Available) and can be used for all the vector types covered in this lesson:
The function is.na() indicates the elements of the vectors that represent missing data, and the function anyNA() returns TRUE if the vector contains any missing values:
## [1] FALSE TRUE FALSE FALSE TRUE
## [1] FALSE FALSE FALSE FALSE FALSE
## [1] TRUE
## [1] FALSE
In R matrices are an extension of the numeric or character vectors. Having rows and columns. As with atomic vectors, the elements of a matrix must be of the same data type.
## [,1] [,2]
## [1,] NA NA
## [2,] NA NA
## [1] 2 2
Content of a matrix -
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Note - Matrices in R are filled column-wise.
You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 11 12 13
A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.
Some additional information on data frames:
read.csv() and read.table(), i.e. when importing the data into R.data.frame() function.nrow(dat) and ncol(dat), respectively.To create data frames by hand:
## id x y
## 1 a 1 11
## 2 b 2 12
## 3 c 3 13
## 4 d 4 14
## 5 e 5 15
## 6 f 6 16
## 7 g 7 17
## 8 h 8 18
## 9 i 9 19
## 10 j 10 20
Useful Data Frame Functions
head()- shows first 6 rowstail()- shows last 6 rowsdim()- returns the dimensions of data frame (i.e. number of rows and number of columns)nrow()- number of rowsncol()- number of columnsstr()- structure of data frame - name, type and preview of data in each columnnames()orcolnames()- both show thenamesattribute for a data framesapply(dataframe, class)- shows the class of each column in the data frame {: .callout} See that it is actually a special list:
Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix).
## [1] 11
As data frames are also lists, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a $.
## [1] 11 12 13 14 15 16 17 18 19 20
## [1] 11 12 13 14 15 16 17 18 19 20
The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain.
| Dimensions | Homogenous | Heterogeneous |
|---|---|---|
| 1-D | atomic vector | list |
| 2-D | matrix | data frame |
read.table()read.csv(), read.delim()write.table()write.csv()Lets consider this example - We are investigating the animal species diversity and weights found within plots at our study site. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
| Column | Description |
|---|---|
| record_id | Unique id for the observation |
| month | month of observation |
| day | day of observation |
| year | year of observation |
| plot_id | ID of a particular plot |
| species_id | 2-letter code |
| sex | sex of animal (“M”, “F”) |
| hindfoot_length | length of the hindfoot in mm |
| weight | weight of the animal in grams |
| genus | genus of animal |
| species | species of animal |
| taxon | e.g. Rodent, Reptile, Bird, Rabbit |
| plot_type | type of plot |
Before we even start the analysis, we need to be sure of where the data are located on our hard drive
file.exists function does exactly what it says on the tin!
## [1] TRUE
## record_id month day year plot_id species_id sex hindfoot_length weight
## 1 1 7 16 1977 2 NL M 32 NA
## 2 72 8 19 1977 2 NL M 31 NA
## 3 224 9 13 1977 2 NL NA NA
## 4 266 10 16 1977 2 NL NA NA
## 5 349 11 12 1977 2 NL NA NA
## 6 363 11 12 1977 2 NL NA NA
## genus species taxa plot_type
## 1 Neotoma albigula Rodent Control
## 2 Neotoma albigula Rodent Control
## 3 Neotoma albigula Rodent Control
## 4 Neotoma albigula Rodent Control
## 5 Neotoma albigula Rodent Control
## 6 Neotoma albigula Rodent Control
Get to know a function
Check the dimensions:
## [1] 13
## [1] 34786
## [1] 34786 13
The names of the columns are automatically assigned:
## [1] "record_id" "month" "day" "year"
## [5] "plot_id" "species_id" "sex" "hindfoot_length"
## [9] "weight" "genus" "species" "taxa"
## [13] "plot_type"
## 'data.frame': 34786 obs. of 13 variables:
## $ record_id : int 1 72 224 266 349 363 435 506 588 661 ...
## $ month : int 7 8 9 10 11 11 12 1 2 3 ...
## $ day : int 16 19 13 16 12 12 10 8 18 11 ...
## $ year : int 1977 1977 1977 1977 1977 1977 1977 1978 1978 1978 ...
## $ plot_id : int 2 2 2 2 2 2 2 2 2 2 ...
## $ species_id : chr "NL" "NL" "NL" "NL" ...
## $ sex : chr "M" "M" "" "" ...
## $ hindfoot_length: int 32 31 NA NA NA NA NA NA NA NA ...
## $ weight : int NA NA NA NA NA NA NA NA 218 NA ...
## $ genus : chr "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
## $ species : chr "albigula" "albigula" "albigula" "albigula" ...
## $ taxa : chr "Rodent" "Rodent" "Rodent" "Rodent" ...
## $ plot_type : chr "Control" "Control" "Control" "Control" ...
## [1] "albigula" "merriami" "flavus" "eremicus"
## [5] "spectabilis" "penicillatus" "hispidus" "torridus"
## [9] "ordii" "sp." "spilosoma" "leucogaster"
## [13] "megalotis" "audubonii" "maniculatus" "harrisi"
## [17] "bilineata" "melanocorys" "squamata" "fulvescens"
## [21] "taylori" "montanus" "ochrognathus" "baileyi"
## [25] "brunneicapillus" "chlorurus" "fulviventer" "intermedius"
## [29] "leucopus" "viridis" "gramineus" "savannarum"
## [33] "leucophrys" "scutalatus" "undulatus" "fuscus"
## [37] "tereticaudus" "tigris" "clarki" "uniparens"
Lets learn indexing
# first element in the first column of the data frame (as a vector)
surveys[1, 1]
# first element in the 6th column (as a vector)
surveys[1, 6]
# first column of the data frame (as a vector)
surveys[, 1]
# first column of the data frame (as a data.frame)
surveys[1]
# The whole data frame, except the first column
surveys[, -1]
# first five
surveys[1:5, ]By name
Check hight frequency
library(ggplot2)
ggplot(data = surveys, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, aes(color = species))## Warning: Removed 4048 rows containing missing values (geom_point).
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Warning: 129 failed to parse.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1977-07-16" "1984-03-12" "1990-07-22" "1990-12-15" "1997-07-29" "2002-12-31"
## NA's
## "129"
Created and Maintained by Sangram Keshari Sahu
Rmarkdown Template used from Rmdplates package
Licensed under CC-BY 4.0
Source Code At GitHub